
In March 2020 the direction of the country began to change. The COVID-19 Pandemic raged through the country and lockdowns began. The result of this was a near standstill of the economy. As consumer demand decreased and businesses shut down, labor markets began to feel the heat. The combination of economic impacts and an increase in workplace hazards resulted in a massive spike in the unemployment rate.
The national rate reached a previously unseen level of 14.7 percent in April 2020. In response to this, Congress passed the CARES Act which contained provisions for enhanced and extended unemployment insurance benefits. While necessary for many to keep afloat during the economic crisis, a potential drawback of this policy is that workers may be disincentivized to work. If the benefit level is sufficiently high, returning to work may pay less money than remaining on unemployment insurance. A rational actor would see this and decide to stay on unemployment insurance and not return to the workforce.

The key question to address then becomes: Which population segment in the United States of America have seen more population disincentivization to return to work due to the Unemployment Insurance under COVID-19 after evaluating weekly data since 2020?
Pooled cross section is a data structured where random samples are obtained from the population at different points in time. Pooled Cross Sections are obtained by collecting random samples from a large population independently at different points in time. Having random samples collected independently implies that they need not be of equal size and will usually contain different statistical units at different points in time.Therefore, limiting serial correlation of residuals when regression analysis is applied. The previous is useful for a variety of purposes, including policy analysis. For example, to study the effects of an unexpected change in unemployment insurance we can choose the treatment group to be unemployed individuals from a state that has a change in unemployment compensation, while a control group could be unemployed workers from a neighboring state. Since each observation is independent, the regression model of pooled cross section is as followed:

Since April 2020, the U.S. Census Bureau, in collaboration with multiple federal agencies, held the Household Pulse Survey(HPS). The main purpose of the survey was to be able to deploy quickly and efficiently, collected data to be able to measure household experiences during the COVID-19 pandemic.
The Survey consist of a 20-minute, online survey which is meant to assess how individuals, families, and communities are being affected by the COVID-19 pandemic. Participants are drawn from randomly selected addresses, and respondents may be asked to complete the survey up to three times to track the effect of the pandemic on households over time.The survey collects data from an average of around 97,000 respondents per week. The Survey continues measuring how the COVID-19 pandemic is impacting households across the US from a social and economic perspective. The phases were planned. Results for Phase 1 were released weekly, and results from Phases 2 and 3 were released every two weeks. The details of each data collection was as followed.
The details for the dataset can be found in the following link: https://www.census.gov/programs-surveys/household-pulse-survey/datasets.html.
A complementary dataset that was also analyzed and incorporated was the Unemployment Insurance replacement rates from the Department of Labor of US available in the following link: https://oui.doleta.gov/unemploy/ui_replacement_rates.asp
The Replacement Rate is the ratio of the claimants’ weekly benefit amount (WBA) to the claimants’ average weekly wage. Two key Replacement Ratios are available:
The HPS provides many advantages in comparison to other datasets that have measured unemployment insurance(i.e: the unemployment insurance of the labor department). Among the advantages that we can mention are:
Some of the initial variables we are using from the dataset include:
Additionally, using the average replacement ratio will allow to further understand the motivation of workers to either return to the labor market given that with only the unemployment insurance benefits they could not be able to truly alleviate severe economic hardship, particularly for low-wage workers.
We would now initiate our Exploratory Data Analysis (EDA) in the following section, considering that our previously stated research question is:
#### Cleaning the data####
# Setting working directory
setwd('/Users/admin/Documents/GitHub/DS6101_G3/Data')
#import and merge all CSV files into one data frame
df_panel <- list.files(path='/Users/admin/Documents/GitHub/DS6101_G3/Data') %>%
lapply(read_csv) %>%
bind_rows
# remove additional unnecessary rows
df_panel<- df_panel[ -c(26:50) ]
df_panel<- df_panel[ -c(26) ]
#creating the panel
df_panel <- df_panel[ -c(4:5)]
df_panel<- panel_data(df_panel, id = SCRAM, wave = WEEK)
#renaming columns
colnames(df_panel)[1:23]=c("ID","Week","State","Age","Hispanic","Other_race","Edu","Sex","p_hh","wrklossrv", 'anywork','kindwork','rsnnowrkrv','UI_apply', 'UI_recrv','UI_recvnow','house', 'income','SPND1','SPND2','SPND3','SPND4','SPND5' )
#Changing birth year to age
df_panel$Age <- 2022-df_panel$Age
#Changing -99 and -88 for NAs
df_panel <- df_panel %>% dplyr::na_if(-99)
df_panel <- df_panel %>% dplyr::na_if(-88)
#Reconverting to panel data
df_panel2<- panel_data(df_panel, id = ID, wave = Week)
df_panel2
# Changing state names first run
rep_str_state = c('10'='Delaware',
'11'='District of Columbia',
'12'='Florida',
'13'='Georgia',
'15'='Hawaii',
'16'='Idaho',
'17'='Illinois',
'18'='Indiana',
'19'='Iowa',
'20'='Kansas',
'21'='Kentucky',
'22'='Louisiana',
'23'='Maine',
'24'='Maryland',
'25'='Massachusetts',
'26'='Michigan',
'27'='Minnesota',
'28'='Mississippi',
'29'='Missouri',
'30'='Montana',
'31'='Nebraska',
'32'='Nevada',
'33'='New Hampshire',
'34'='New Jersey',
'35'='New Mexico',
'36'='New York',
'37'='North Carolina',
'38'='North Dakota',
'39'='Ohio',
'40'='Oklahoma',
'41'='Oregon',
'42'='Pennsylvania',
'44'='Rhode Island',
'45'='South Carolina',
'46'='South Dakota',
'47'='Tennessee',
'48'='Texas',
'49'='Utah',
'50'='Vermont',
'51'='Virginia',
'53'='Washington',
'54'='West Virginia',
'55'='Wisconsin',
'56'='Wyoming')
df_panel2$State<- str_replace_all(df_panel2$State,rep_str_state)
# Changing state names second run
rep_str_state2 = c('1'='Alabama',
'2'='Alaska',
'4'='Arizona',
'5'='Arkansas',
'6'='California',
'8'='Colorado',
'9'='Connecticut')
df_panel2$State<- str_replace_all(df_panel2$State,rep_str_state2)
# removing NA's from ID column in dataframe
df_panel2<-df_panel2[!is.na(df_panel2$ID),]
df_panel2<-df_panel2[!is.na(df_panel2$State),]
#Adding mean replacement rate column by state and week
df_panel2$Week2[df_panel2$Week %in% 1:9]<- '1-9'
df_panel2$Week2[df_panel2$Week %in% 10:15]<- '10-15'
df_panel2$Week2[df_panel2$Week %in% 16:21]<- '16-21'
df_panel2$Week2[df_panel2$Week %in% 22:27]<- '22-27'
df_panel2$Week2[df_panel2$Week %in% 28:33]<- '28-33'
df_panel2$Week2[df_panel2$Week %in% 34:38]<- '34-38'
df_panel2$Week2[df_panel2$Week %in% 39:40]<- '39-40'
df_panel2$Week2[df_panel2$Week %in% 41:43]<- '41-43'
df_panel2$Week2[df_panel2$Week %in% 44:49]<- '44-49'
#Importing replacement rate data CSV
setwd('/Users/admin/Documents/GitHub/DS6101_G3')
rr<- read_csv("Replacement_rate.csv")
#Adding Mean replacement rate to data panel
df_panel3<- left_join(df_panel2, rr, by= c('Week2'='week','State'))
df_panel3
#removing additional unnecessary columns
df_panel3 <- as.data.frame(df_panel3)
df_panel3 <- df_panel3[ -c(19:24) ]
#copying df2 to df3
df_panel2<-df_panel3
#Changing categorical values for traditional 0 & 1 triple check this with dictionary
df_panel2$Hispanic[df_panel2$Hispanic==2]<- 0
df_panel2$Sex[df_panel2$Sex==2]<- 'Female' #Changing to categorical
df_panel2$Sex[df_panel2$Sex==1]<- 'Male' #Changing to categorical
df_panel2$Edu[df_panel2$Edu==1]<- 'Less than high school' #Changing to categorical
df_panel2$Edu[df_panel2$Edu==2]<- 'Some high school' #Changing to categorical
df_panel2$Edu[df_panel2$Edu==3]<- 'High school grad or equiv' #Changing to categorical
df_panel2$Edu[df_panel2$Edu==4]<- 'Some college no degree received' #Changing to categorical
df_panel2$Edu[df_panel2$Edu==5]<-"Associate's degree" #Changing to categorical
df_panel2$Edu[df_panel2$Edu==6]<- "Bachelor's degree" #Changing to categorical
df_panel2$Edu[df_panel2$Edu==7]<- 'Graduate degree' #Changing to categorical
#df_panel2$income[df_panel2$income==1]<- 'Less than $25,000' #Changing to categorical
#df_panel2$income[df_panel2$income==2]<- '$25,000 - $34,999' #Changing to categorical
#df_panel2$income[df_panel2$income==3]<- '$35,000 - $49,999' #Changing to categorical
#df_panel2$income[df_panel2$income==4]<- '$50,000 - $74,999' #Changing to categorical
#df_panel2$income[df_panel2$income==5]<- '$75,000 - $99,999' #Changing to categorical
#df_panel2$income[df_panel2$income==6]<- '$100,000 - $149,999' #Changing to categorical
#df_panel2$income[df_panel2$income==7]<- '$150,000 - $199,999' #Changing to categorical
#df_panel2$income[df_panel2$income==8]<- '$200,000 and above' #Changing to categorical
df_panel2$income2[df_panel2$income==1]<- 'Less than $25,000' #Changing to categorical
df_panel2$income2[df_panel2$income==2]<- '$25,000 - $34,999' #Changing to categorical
df_panel2$income2[df_panel2$income==3]<- '$35,000 - $49,999' #Changing to categorical
df_panel2$income2[df_panel2$income==4]<- '$50,000 - $74,999' #Changing to categorical
df_panel2$income2[df_panel2$income==5]<- '$75,000 - $99,999' #Changing to categorical
df_panel2$income2[df_panel2$income==6]<- '$100,000 - $149,999' #Changing to categorical
df_panel2$income2[df_panel2$income==7]<- '$150,000 - $199,999' #Changing to categorical
df_panel2$income2[df_panel2$income==8]<- '$200,000 and above' #Changing to categorical
df_panel2$wrklossrv[df_panel2$wrklossrv==2]<- 0
df_panel2$anywork[df_panel2$anywork==2]<- 0
df_panel2$UI_apply [df_panel2$UI_apply ==2]<- 0
df_panel2$UI_recrv[df_panel2$UI_recrv==2]<- 0
df_panel2$UI_recvnow[df_panel2$UI_recvnow==2]<- 0
#Importing regions data CSV
setwd('/Users/admin/Documents/GitHub/DS6101_G3')
reg<- read_csv("US_regions.csv")
#Adding regions to data panel
df_panel2<- left_join(df_panel2, reg, by= c('State'))
#Number of people repeated in sample
count_reps<- df_panel2 %>%
group_by(ID) %>%
summarise(n=n())
N_rep<- count_reps %>%
filter(n > 1)
# removing ids column
df_panel2<- df_panel2[ -c(1) ]
#converting data frame to panel
df_panel2<- panel_data(df_panel2, id = State, wave = Week)
#Sub-setting the data sample based to remove persons with an income greater than and that haven't been unemployed
df_panel2 <- subset(df_panel2, income != 7 & income != 8 & wrklossrv != 0 & anywork != 1)
> #Checking the type of panel balanced or unbalanced
>
> print('Information about the panel:')
> pdim(df_panel2)
>
> # Amount of unique values
> print('Amount of unique values:')
> length(unique(df_panel2$ID))
>
>
> #Number of people repeated in sample
> print('Number of people repeated in sample: ')
> N_rep
>
> #Checking panel structure
> #str(df_panel2)
> glimpse(df_panel2)
> xkablesummary(df_panel2, title="Summary Statistics",pos='left', bso= 'condensed')
Evaluating the cross sectional panel, we were able to see that the panel is unbalanced and it has more than 3 million observations. The sample size was reduced when we eliminated the persons that had an income higher than $150,000 and that did not lose employment, given that they would not be the most vulnerable group affected by the pandemic.This reduced the sample size to close to 300 thousand observations. As it was previously mentioned, most of the people in our sample were not interviewed more than three times. Given that the characteristic of cross- sectional data solves normality, rather than focusing in doing normality tests of the data, we focused in understanding it further and the linkages that they seem to be present between variables. Taking a careful look at our data set is relevant so we can better understood the best approach for a regression analysis. The following section highlights some of the key plots
#** Interviewed sample per sex and state**
theme_set(theme_classic())
ss <- ggplot(df_panel2, aes(x=fct_infreq(State)))
ss + geom_bar(aes(fill=Sex), width = 0.7) +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Sample per sex and state",
subtitle="Number of persons",
x= 'States',
y= 'Frequency')
When analyzing the data we were able to find that California, Texas and Florida had the higher amount of unemployed population, while South Dakota and North Dakota the lowest. Gender distribution seems to be similar across states.
# **Box plot per age and state**
ggplot(aes(x=Sex, y=Age, fill= Sex), data=subset(df_panel2, !is.na(Sex))) +
geom_boxplot() +
stat_summary(fun=mean, geom="point", shape=4)+
labs(title="Box Plot Age/ Sex",
x= 'Sex',
y= 'Age')+
theme(legend.position = 'none')
Regarding age and sex, the mean age of males was higher than that of females, with no outliers visible in the dataset.
e1 <- ggplot(data = df_panel2,
mapping = aes(x = Region, fill = Edu))
e1 + geom_bar(position = "dodge",
mapping = aes(y = ..prop.., group = Edu))+
labs(title="Levels of education of the sample",
subtitle= 'by US region',
x= 'Education',
y= 'Proportion of the sample')
Per region, the West region shows the higher amount of population with less than high school certificates, while the Northeast shows a higher percentage of persons with graduate degrees. The South and West region seems to have a higher concentration of population. In the Midwest region associate and graduate degree has the highest concentration of population.
replacerate1 <- ggplot(df_panel2, aes(x=Region, y = `Average of Sum of Replacement Ratio 2`, fill = Region)) +
geom_boxplot() +
stat_summary(fun=mean, geom="point", shape=4) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
panel.grid.major.y = element_line(color = "lightgray"),
axis.line = element_line(color = "black"),
axis.ticks.y=element_blank(),
axis.ticks.x=element_blank(),
axis.line.y = element_blank(),
axis.line.x = element_blank(),
legend.position = "none") +
labs(x = "Region", y = "Average of Replacement Ratio ", title = "Average of Replacement Ratio ", subtitle = "By Region")
replacerate1
The replacement rate is higher in the Northeast and West. The higher the replacement rates the more they can alleviate economic hardship. The south region is the one that shows the lowest replace ratio.
library(ggbeeswarm)
ggplot(df_panel2, aes(y = income, x = Other_race, color = Other_race)) +
geom_quasirandom(alpha = 0.5) +
theme_minimal(base_size = 13) +
scale_color_viridis_c(guide = "none") +
scale_y_discrete(limits = c(1:6), labels =c('Less than $25,000', '$25,000 - $34,999', '$35,000 - $49,999', '$50,000 - $74,999', '$75,000 - $99,999', '$100,000 - $149,999')) +
scale_x_discrete(limits = c(1:4), labels =c('White', 'Black', 'Asian', 'Other')) +
labs(y = "Income",
x = "Race",
title = "Household income distribution by largest racial group")
library(gganimate)
#Finding percentage of people that are unemployed and received the benefit
non<-df_panel2[!is.na(df_panel2$UI_recrv),]
t1<-non %>%
filter(UI_recrv == 1) %>%
group_by(Week,State) %>%
count()
colnames(t1)<- c('State', 'Week','UI')
t0<-non %>%
filter(UI_recrv == 0) %>%
group_by(Week,State) %>%
count()
colnames(t1)<- c('State', 'Week','UIO')
t1_0<- t1 %>% left_join(t0, by= c('State', 'Week'))
t1_0<-t1_0%>% mutate(tot = UIO + n, .after = n)
t1_0<-t1_0%>% mutate(x = UIO /tot, .after = tot)
#Plot
UI_rec <- ggplot(t1_0,
aes(x = Week, y= x, color = State)) +
geom_point()+
transition_manual(Week)+
labs(x = "Weeks", y = "% of unemployed persons receiving unemployment insurance", title = "Change in Unemployment Insurance received by state")
ggplotly(UI_rec)
animate(UI_rec, renderer = gifski_renderer())
anim_save("UI_rec.gif")
#install.packages(c('gapminder','gganimate','gifski'))
#library(gapminder)
#library(gganimate)
#library(gifski)
#ggsave('UI_recieved.png')
#library(gganimate)
#anim_save('UI_rec')
#install.packages("png")
#library(png)
#p1 <- readPNG("UI_recieved.png")
#library(caTools)
#install.packages("magick")
#library(magick)
#write.gif(p1, "UI_recieved.png")
#showGIF <function(fn) system(paste("display",fn))
#showGIF("myPlot.gif")
#unlink("myPlot.gif")
This graph breaks down the percent of unemployment insurance receipt throughout the weeks. Each dot represents the percent of people in the survey for each state that received unemployment Our data begins to captures unemployment insurance being received at week 13. This is likely due to the surge of unemployment claims in the early pandemic causing delays. There is a sharp drop between week 27(March 17 - 29) & week 28 (April 14 - April 26). Beginning the week of April 12, all essential workers in Phase 1C Tier 3 will become eligible for the vaccine.
Sharp decline may have happened due to the expansion of eligibility
for the covid vaccine; April 19, 2021 The White House announced that all
people age 16 and older are eligible for the COVID-19 vaccine. After may
1 everyone was eligible for the vaccine; while there is no direct tested
correlation it’s interesting to point out the decline and the dates of
vaccine eligibility
Steady decline after week 40 which was taken after PUA had ended, the
recording stops in September of this year where recipients are at their
lowest since beginning of data collection.
The amount and frequency of the data in which we are working will allow to be able to avoid statistical mistakes and to be able to obtain a robust regression analysis that could provide relevant insight on unemployment. Having seen how sex and education change through the sample through state and the different percentage of unemployed receiving the benefit, using sex, education, and age as control variables will allow us in the next steps to obtain relevant conclusions that can contribute in informing employment policy in US.
It is important to mention that although our approach uses data that can represent fairly well the US population under unemployment and receiving unemployment insurance there are still some limitations associated to it:
Once the data has been more carefully understood the next steps will required to analyze the type of regression analysis that we would need to undertake. Some of the options can included difference in difference regression, fixed and random effects.Dummy variables will also be selected in order to test difference in intercept terms or slope coefficient, allow the intercept to have different values in each period.
| Variable | Codes | Description |
|---|---|---|
| Hispanic | 0 1 |
Hispanic Non Hispanic |
| Other Race | 1 2 3 4 |
White Black Asian Any other race |
| Education | 1 2 3 4 5 6 7 |
Less than high school Some high school High school grad or equiv Some college no degree received Associate’s degree Bachelor’s degree Graduate degree |
| Sex | 0 1 |
Female Male |
| p_hh | 1-40 |
Number of people in household |
| wrklossrv | 0 1 |
Recent household job loss No Yes |
| anywork | 0 1 |
Employment status for last 7 days No Yes |
| kindwork | 1 2 3 4 5 |
Sector of Employment Government Private company Non profit Self-employed Working in family business |
| rsnnowrkrv | 1 2 3 4 5 6 7 8 9 10 11 12 |
Main reason for not working Didn’t want to be employed Sick or caring for COVID Caring for children Caring for an elderly Concerned getting/spreading COVID Sick (no COVID)or disabled Retired Laid off due to COVID Employer closed temporarily due to COVID Employer went out of business due to COVID Did not have transportation to work Other reason |
| UI_apply | 0 1 |
UI Apply No Yes |
| UI_recrv | 0 1 |
UI Receive No Yes |
| UI_recvnow | 0 1 |
|UI Receive now | No | Yes |
| house | 0 1 |
Housing owned or rented No Yes |
| income | 1 2 3 4 5 6 7 8 |
Total household income Less than $25,000 $25,000 - $34,999 $35,000 - $49,999 $50,000 - $74,999 $75,000 - $99,999 $100,000 - $149,999 $150,000 - $199,999 $200,000 and above |
| Mean RR 1 | Replacement Ratio 1=Weighted Avg
of:WBA/(Normal Hourly Wage x 40 Hrs.) |
| Mean RR 2 | Replacement Ratio 2=Ratio of: Weighted Avg
WBA/Weighted Avg (Normal Hourly Wage x 40| Hrs.) |
| Region | States |
|---|---|
| Northeast | Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, Vermont, New Jersey, New York, Pennsylvania |
| South | Delaware, District of Columbia, Florida, Georgia, Maryland, North Carolina, South Carolina, Virginia, West Virginia, Alabama, Kentucky, Mississippi, Tennessee, Arkansas, Louisiana, Oklahoma, Texas |
| Midwest | Illinois, Indiana, Michigan, Ohio, Wisconsin, Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota, South Dakota |
| West | Arizona, Colorado, Idaho, Montana, Nevada, New Mexico, Utah, Wyoming, Alaska, California, Hawaii, Oregon, Washington |